Topic Trend Detection in Text Collections using Latent Dirichlet Allocation

نویسندگان

  • Levent Bolelli
  • Seyda Ertekin
  • Ding Zhou
  • C. Lee Giles
چکیده

Algorithms that enable the process of automatically mining distinct topics in document collections have become increasingly important due to their applications in many fields and the extensive growth of the number of documents in many domains. Traditionally, the task of topic discovery has been mainly addressed through algorithms that work on a snapshot view of the repository, which ignores the temporal characteristics of the collection. In a significant number of collections, the documents are temporal in nature and this temporal dimension can influence the topic discovery process. This paper proposes a generative model based on latent Dirichlet allocation that integrates the temporal ordering of the documents into the generative process in an iterative fashion. The document collection is divided into time segments where the discovered topics in each segment is propagated to influence the topic discovery in the subsequent time segments. We conduct experiments on the collection of academic papers from CiteSeer repository. In addition to the textual content of the documents, we augment the text corpus with the addition of user queries and tags and integrate the citation graph to boost the weight of the topiPreprint submitted to Information Systems 24 December 2007 cal terms. The experiment results show that segmented topic model can effectively detect distinct topics and their evolution over time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Topic Trend Detection in Text Collections using Latent Dirichlet Allocation

Algorithms that enable the process of automatically mining distinct topics in document collections have become increasingly important due to their applications in many fields and the extensive growth of the number of documents in many domains. Traditionally, the task of topic discovery has been mainly addressed through algorithms that work on a snapshot view of the documents, which ignores the ...

متن کامل

Topic and Trend Detection in Text Collections Using Latent Dirichlet Allocation

Algorithms that enable the process of automatically mining distinct topics in document collections have become increasingly important due to their applications in many fields and the extensive growth of the number of documents in various domains. In this paper, we propose a generative model based on latent Dirichlet allocation that integrates the temporal ordering of the documents into the gene...

متن کامل

Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps

Clustering and visualization of large text document collections aids in browsing, navigation, and information retrieval. We present a document clustering and visualization method based on Latent Dirichlet Allocation and self-organizing maps (LDA-SOM). LDA-SOM clusters documents based on topical content and renders clusters in an intuitive twodimensional format. Document topics are inferred usin...

متن کامل

Using Variational Inference and MapReduce to Scale Topic Modeling

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference of LDA. In this paper, we propose a technique called MapReduce LDA (Mr. LDA) to accommodate very large corpus collections in the MapReduce framework. In contrast to other techni...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007